Analyzing Performance

You've learned about three different TD control methods in this lesson. So, what do they have in common, and how are they different?

## Similarities

All of the TD control methods we have examined (Sarsa, Sarsamax, Expected Sarsa) converge to the optimal action-value function q_ (and so yield the optimal policy \pi_) if:

the value of \epsilon decays in accordance with the GLIE conditions, and
the step-size parameter \alpha is sufficiently small.

## Differences

The differences between these algorithms are summarized below:

Sarsa and Expected Sarsa are both on-policy TD control algorithms. In this case, the same (\epsilon-greedy) policy that is evaluated and improved is also used to select actions.
Sarsamax is an off-policy method, where the (greedy) policy that is evaluated and improved is different from the (\epsilon-greedy) policy that is used to select actions.
On-policy TD control methods (like Expected Sarsa and Sarsa) have better online performance than off-policy TD control methods (like Sarsamax).
Expected Sarsa generally achieves better performance than Sarsa.

If you would like to learn more, you are encouraged to read Chapter 6 of the textbook (especially sections 6.4-6.6).

As an optional exercise to deepen your understanding, you are encouraged to reproduce Figure 6.4. (Note that this exercise is optional!)

The figure shows the performance of Sarsa and Q-learning on the cliff walking environment for constant \epsilon = 0.1. As described in the textbook, in this case,

Q-learning achieves worse online performance (where the agent collects less reward on average in each episode), but learns the optimal policy, and
Sarsa achieves better online performance, but learns a sub-optimal "safe" policy.

You should be able to reproduce the figure by making only small modifications to your existing code.